Prognosing post-treatment outcomes of head and neck cancer using structured data and machine learning: A systematic review

Background This systematic review aimed to evaluate the performance of machine learning (ML) models in predicting post-treatment survival and disease progression outcomes, including recurrence and metastasis, in head and neck cancer (HNC) using clinicopathological structured data. Methods A systematic search was conducted across the Medline, Scopus, Embase, Web of Science, and Google Scholar databases. The methodological characteristics and performance metrics of studies that developed and validated ML models were assessed. The risk of bias was evaluated using the Prediction model Risk Of Bias ASsessment Tool (PROBAST). Results Out of 5,560 unique records, 34 articles were included. For survival outcome, the ML model outperformed the Cox proportional hazards model in time-to-event analyses for HNC, with a concordance index of 0.70–0.79 vs. 0.66–0.76, and for all sub-sites including oral cavity (0.73–0.89 vs. 0.69–0.77) and larynx (0.71–0.85 vs. 0.57–0.74). In binary classification analysis, the area under the receiver operating characteristics (AUROC) of ML models ranged from 0.75–0.97, with an F1-score of 0.65–0.89 for HNC; AUROC of 0.61–0.91 and F1-score of 0.58–0.86 for the oral cavity; and AUROC of 0.76–0.97 and F1-score of 0.63–0.92 for the larynx. Disease-specific survival outcomes showed higher performance than overall survival outcomes, but the performance of ML models did not differ between three- and five-year follow-up durations. For disease progression outcomes, no time-to-event metrics were reported for ML models. For binary classification of the oral cavity, the only evaluated subsite, the AUROC ranged from 0.67 to 0.97, with F1-scores between 0.53 and 0.89. Conclusions ML models have demonstrated considerable potential in predicting post-treatment survival and disease progression, consistently outperforming traditional linear models and their derived nomograms. Future research should incorporate more comprehensive treatment features, emphasize disease progression outcomes, and establish model generalizability through external validations and the use of multicenter datasets.


Introduction
Head and neck cancer (HNC) is the seventh most common cancer globally, accounting for more than 660,000 new cases and 325,000 deaths annually.By 2030, the incidence of HNC is expected to rise by 30% compared to the 2020 rate, largely driven by increases in oropharyngeal cancer [1,2].Post-treatment recurrences and metastases are common occurrences that contribute to poor prognosis of HNC [3,4].The five-year relative survival rate of HNC has improved during the past decades from 54.1% (1975-84) to 66.8%  based on the Surveillance, Epidemiology, and End Results (SEER) data [5].Despite the increase in survival rates, per capita death rates have risen over the last decade, reflecting the predominance of the increase in incidence over the survival rate [2].Squamous cell carcinoma (SCC) is the most common type of HNC, constituting 90% of the cases and attracting considerable research aimed at enhancing diagnostic, prognostic, and therapeutic interventions.[6].
Traditional prognostication of HNC outcomes primarily relied on nomograms and linear models that took into account factors such as the primary tumor size and extent, lymph node involvement, and the presence of distant metastasis [3].However, this approach inadequately addressed the inherent heterogeneity among HNC patients, leading to less accurate individual risk assessments.In response, recent models have incorporated a more diverse range of prognostic variables, including patient demographics, histopathological information, treatment details, comorbidities, and molecular markers [7,8].Concurrently, machine learning (ML) has emerged as a promising tool, leveraging its capacity for non-parametric modeling to analyze extensive and intricate datasets more flexibly.This approach allows for the accounting of non-linear relationships and interactions between predictors, offering a more nuanced understanding of the data [9,10].
Although there has been a shift toward including unstructured data such as medical images in the prognosis of HNC outcomes, utilizing structured data such as tabulated clinicopathological features offers several advantages.Structured data is more amenable to systematic organization and analysis, whereas unstructured data requires extensive preprocessing to convert it into a format suitable for analysis [11,12].Recent advancements in deep learning (DL) have reduced the need for extensive preprocessing by substituting it with the requirement for large datasets for model development [13].However, the scarcity of large, suitable datasets for training presents a challenge to this approach.Structured data typically comprises variables with established clinical relevance, leading to well-specified models with greater interpretability.This characteristic also makes structured data more consistent across different institutions and includes formats like text, images, audio, and video, which are not easily searchable or analyzable using conventional data processing techniques.
2. used structured data other than clinicopathological such as biochemical markers, molecular, and genomic data; 3. reported on prognostic outcomes other than survival and disease progression; 4. only reported on validation and not the development of models.

Focused PICO-TS question
What is the predictive performance of ML models, developed based on clinicopathological data, in informing the survival and disease progression of HNC patients?
Participants (P): HNC patients who received treatments.
Intervention (I): application of ML models in predicting post-treatment survival and disease progression outcomes Comparison (C): ML models' predictive performance compared to actual events and/or traditional linear models.Examples of traditional models include logistic and Cox proportional hazards (Cox PH) regressions.
Timeframe (T): The review will include both retrospective and prospective studies that report outcomes based on data collected from patients over varying follow-up periods to ensure the analysis reflects long-term prognostic performance.We will specify the follow-up durations for each included study in the review results.
Settings (S): The review will include studies conducted in various clinical settings, such as hospitals, cancer research centers, and academic institutions.The data will be primarily sourced from electronic health records (EHRs), which are routinely used in clinical practice.

Selection of studies
A two-stage screening (title-abstract, full text) was carried out by two authors independently (MM, PAZ).Title management was performed by a commercially available software program (Covidence systematic review software, Veritas Health Innovation, Melbourne, Australia).The duplicates were removed within and between the databases.The full texts of potential articles were retrieved and evaluated using an eligibility form.Any disagreements on the selection of studies were discussed and resolved.The reasons for excluding articles not meeting the eligibility criteria were reported.

Data extraction
Using a predesigned data extraction form, the following information was extracted from the papers that met the eligibility criteria: title, authors' names, authors' affiliations, database, year of publication population, sample size, case-to-control ratio, outcome measure, tumor site, tumor histology, clinicopathological features, ML models, dimensionality reduction, feature selection, resampling techniques, imbalance class correction, validation (i.e., internal vs. external), performance metrics, traditional model comparator, and authors' conclusion.

Risk of bias
The systematic review assessed the quality of the included studies using the Prediction model Risk Of Bias ASsessment Tool (PROBAST) [17].This tool, tailored for evaluating diagnostic and prognostic models, examines bias risk across four domains: participants, predictors, outcomes, and analysis.The evaluation employed 20 signaling questions, categorizing the risk of bias as low, unclear, or high.Applicability concerns for the initial three domains were similarly classified as low, unclear, or high.An overall low risk of bias was determined if all domains received a low rating.However, for prediction models developed without external validation, even if all domains were rated low, the risk of bias should be considered high unless the model was developed on a very large dataset and included internal validation [18].Two authors (MM, PAZ) independently conducted the assessments, with cross-checking to ensure accuracy and consistency.

Data analysis
A qualitative methodology was employed to interpret and summarize the findings from the selected studies.This synthesis encompassed study characteristics, clinicopathological features, ML models, validation methods, and performance metrics.Subgroup analyses of the results were adapted based on the tumor site.Specifically, the following sites were considered separately: oral cavity, oropharynx, nasopharynx, hypopharynx, and larynx.When a specific subsite, such as the tongue, was investigated, it was categorized under a broader category, such as the oral cavity.If a study combined different sites without reporting performance for each site separately, they were categorized as HNC.The synthesis of the results was performed when there were at least two or more studies available for each site.When feasible, and if not reported, metrics such as the F1-score were derived from other metrics such as recall and precision.For binary classification tasks that did not incorporate time modeling, our report primarily focuses on AUROC as a threshold-independent metric and the F1-score as a thresholddependent metric, provided these were reported.In cases where studies conducted time-toevent analyses, considering both time modeling and censored data, the primary performance metric employed was the C-index.In addition to discrimination metrics, calibration metrics were reported if they were assessed in a study.While all performance metrics were documented, only those derived from test sets (either internal or external) were utilized for synthesis.The discussion includes an analysis of the limitations and trends observed across the studies, providing insights into the current state of ML models used in this domain.

Assessment of methodological quality
Table 1 outlines a comprehensive assessment of the risk of bias and applicability concerns.Among the 34 studies evaluated, six were assessed as exhibiting a low overall risk of bias, 24 exhibited a high risk, and four exhibited an unclear risk of bias.In the participant domain, all studies were characterized by a low risk of bias except for one with unclear risk.Concerning the predictor domain, an unclear risk of bias was identified in three studies, whereas the remaining studies were considered to have a low risk.For the outcome domain, one study was identified as high risk, four had an unclear risk, and the remainder were categorized as low risk.The most significant source of bias emerged in the analysis's domain, with 24 studies classified as high risk.The primary contributors to this high-risk rating were small sample sizes, which resulted in a ratio of participants with the outcome to the number of predictor candidates being less than 10.Additional contributing factors included a lack of cross-validation or a test set, not accounting for complexities in data such as censoring or competing risks, the utilization of univariate analyses for feature selection, and failure to report performance metrics for both discrimination and calibration.In the analysis's domain, four other studies exhibited an unclear risk of bias, while only six were evaluated as having a low risk.Regarding applicability concerns, all studies demonstrated a low risk across all domains (Fig 2 ).

Study characteristics
The characteristics of the included studies are summarized in chronological order in Tables 2  and 3 for the survival and disease progression outcomes.The publication dates ranged from 2015 to 2024 with only three studies published before 2019.Out of the 34 studies analyzed, 15 were based on the US population data, with 10 of those studies being conducted by non-US institutions using SEER datasets.The sample sizes ranged from 145 to 177,714.Of the 34 studies, 24 were on survival outcomes while 11 were on disease progression; one study covered both survival and disease progression.For the survival outcome, 11 studies reported the performance of ML models for tumors located in the oral cavity, eight for the larynx, five for the oropharynx, three for the hypopharynx, and two for the nasopharynx; also six studies reported  the performance of ML models for HNC without differentiating for different sites.Regarding disease progression outcomes, nine studies targeted the oral cavity, while one study focused on the larynx, and another study HNC.
According to Table 4, 11 out of 24 studies conducted time-to-event analyses for the survival outcome.In contrast, 14 studies performed binary classification, and one study carried out regression task.Also, six out of 24 studies performed calibration besides discrimination analyses.Referring to Table 5, for the disease progression outcome, only one study undertook timeto-event analyses, whereas the remaining 10 studies engaged in binary classification.Considering Tables 4 and 5 for both survival and disease progression outcomes, 12 studies implemented dimensionality reduction or feature selection, and also eight studies addressed the class imbalance through correction techniques.Concerning validation methods for the developed models, four studies utilized external validation, two applied temporal validation, and the remaining studies depended on internal validation.

Performance of models for survival outcome
3.4.1.Time-to-event models.For HNC as a whole, three studies reported C-indices ranging from 0.70 to 0.79 for ML models; In the same studies, the C-indices for Cox PH studies ranged from 0.66 to 0.76 [66,68,75].For the oral cavity, the C-index for four studies conducting time-to-event analyses ranged from 0.73 to 0.89, with three of these studies also reporting on Cox PH models, showing C-indices between 0.69 and 0.77 [60,71,72,75].For the larynx, five studies reported C-indices for ML models ranging from 0.71 to 0.85, and for Cox PH models in the same studies, ranging from 0.57 to 0.74 [72,75,78,80,86].For the oropharynx, three studies reported C-indices ranging from 0.77 to 0.80, [72,73,75], but there was not enough data to consolidate the C-index for Cox PH models.For the hypopharynx and nasopharynx, only two studies reported C-indices for ML models, ranging from 0.72 to 0.79 [72,75] and from 0.72 to 0.83 [76,77].The C-index for Cox PH could not be consolidated for hypopharynx and nasopharynx.
Among the ML models, the Random Survival Forest (RSF) was employed in nine studies and compared with other ML models in seven of these studies.RSF outperformed the other models in three of these comparative studies [66,72,75].Among the other four studies, two incorporated DeepSurv and demonstrated superior performance compared to RSF [60,71].Additionally, in another study, a deep neural network model [80] and survival SVM [77] outperformed RSF.Furthermore, DeepSurv excelled in all three studies in which it was compared to other models.The detailed performance metrics for the best-performing models can be found in Table 4.

Binary classification models.
For HNC as a whole, four studies reported AUROC ranging from 0.75 to 0.97, and three of these studies also reported F1-scores ranging from 0.65 to 0.89; based on three of these studies, AUROC for logistic regressions ranged from 0.71 to 0.84, and based on two studies, F1-scores ranged from 0.54 to 0.77 [67,68,75,76].For the oral cavity, seven studies reported AUROC and F1-score for ML models, ranging from 0.61 to 0.91 and 0.58 to 0.86, respectively, with four of these studies also reporting on logistic regression models, which ranged from 0.52 to 0.69 for AUROC and 0.57 to 0.62 for F1-score [56,58,63,75,76,87,88].For the larynx, four studies reported AUROC for ML models ranging from 0.76 to 0.97, and three of these studies also reported F1-scores ranging from 0.63 to 0.92; from these four, two studies also reported on logistic regression with AUROC ranging from 0.76 to 0.92 [75,76,81,85].For the oropharynx, three studies reported AUROC for ML models ranging from 0.93 to 0.97 and F1-scores ranging from 0.90 to 0.92, but there was not enough information to consolidate results for logistic regression [74][75][76].Moreover, the AUROC of ML models ranged from 0.77 to 0.85 for two studies that reported on the hypopharynx [75,82].
The results for the nasopharynx could not be synthesized since only one study reported on the related metrics.Among ML models, tree-based models were used more often and generally demonstrated superior performance in terms of AUROC and F1-scores compared to other algorithms.In nine studies that conducted comparative analyses, six found that tree-based models, including random forest and XGBoost, outperformed others [58,63,76,82,85,87].However, in one study, a voting classifier of random forest, logistic regression, and Gaussian Naïve Bayes was superior [88].In another study, a Support Vector Machine (SVM) excelled [81], and in one study, neural networks showed superior performance [67].The detailed performance metrics for the best-performing models can be found in Table 4.

Performance of models for disease progression outcomes
3.5.1.Time-to-event models.The only study that performed time-to-event analyses did not report a c-index for the RSF model, but it was 0.60 for Cox PH [83].
3.5.2.Binary classification models.Overall, out of 11 studies, nine studies reported on the performance of ML models on the oral cavity from which five studies reported AUROC values ranging from 0.67 to 0.97 [59,64,65,79,84].Among these, three studies also evaluated traditional linear models, with the AUROC for corresponding ML models ranging from 0.67 to 0.88, while for linear models including logistic regression, it ranged from 0.68 to 0.73 [64,79,84].Additionally, six studies provided F-1 scores for ML models, with values ranging from 0.53 to 0.89 [57,59,62,69,79,84].Of these, three studies included F1-scores for linear models as well, with the corresponding ML models' F1-scores ranging from 0.53 to 0.89 and linear models' scores from 0.30 to 0.87 [69,79,84].Besides the oral cavity, other sites either had no or only one related study, so their results could not be consolidated.
For the disease progression outcomes, unlike the survival outcomes, there was not a consistent trend indicating superior performance of tree-based models.The specific performance metrics for the top-performing models are detailed in Table 5.

Discussion
This systematic review assessed the utilization and performance of ML models in predicting post-treatment survival and disease progression in HNC patients, based on structured clinicopathological data.ML models consistently surpassed traditional models, including Cox PH and logistic regressions, as well as nomograms derived from these linear models.Furthermore, the results indicated that, in time-to-event analyses, ML models demonstrated superior performance for specific HNC subsites, such as the oral cavity, oropharynx, and larynx, compared to overall HNC.This discrepancy may be attributed to the distinct risk factors, etiology, tumor biology, incidence, treatment strategies, and prognosis associated with each subsite.[89].
The enhanced performance of ML models in modeling time and censored data for the survival outcome of different sites, including HNC as a whole (C-index range: 0.70 to 0.79), oral cavity (C-index range: 0.73 to 0.89), and larynx (C-index range: 0.71 to 0.85), compared to traditional models such as Cox PH for HNC (C-index range: 0.66 to 0.76), oral cavity (C-index range: 0.69 to 0.77), and larynx (C-index range: 0.57 to 0.74), can be attributed to several factors.Primarily, ML models are non-parametric, meaning they do not presuppose a specific form for the data and can also identify complex non-linear relationships between variables, which is beyond the capability of traditional linear models.Additionally, models such as Cox PH presume that the risks associated with different individuals are proportional over time, suggesting that at any given time point, a specific individual will always have a higher or lower risk than another, implying that their survival curves will never intersect, a premise not always valid in real-world settings.Moreover, multicollinearity is not an issue for most ML models, and they can process numerous features and their interactions, providing them with an intrinsic ability to dissect population-level incidence more effectively by considering a broader range of individuals' features, thereby offering superior personalized risk predictions.
A variety of ML models were utilized for time-to-event analyses of the survival outcome.Among these models, DeepSurv emerged as the top performer.DeepSurv, a deep feed-forward neural network, outperformed the other models in every study in which it was employed [60,71,73].DeepSurv implements a deep learning generalization of the Cox proportional hazards model and has an advantage over traditional Cox PH because it does not require a priori selection of covariates, but rather learns them adaptively.It has the capacity to model complex nonlinear relationships between a patient's covariates and the hazard function [90].In the absence of DeepSurv, RSF showed the best performance in three studies [66,72,75].RSF ensemble approach, combining multiple decision trees, enhances predictive accuracy, while its provision of variable importance measures offers valuable insights into influential factors affecting outcomes.RSF's minimal preprocessing requirement is particularly beneficial in the medical field.Moreover, it adeptly handles non-linear relationships.Importantly, RSF allows for individual hazard functions to intersect, accommodating non-proportional hazards, which contrasts with models like DeepSurv that assume proportional hazards across individuals [91].
In the binary classification of survival outcomes, the observed trends paralleled those in time-to-event analyses with ML models outperforming linear models such as logistic regression.A broad spectrum of ML models was utilized for binary classification, encompassing simple neural networks, deep learning, SVMs, Naïve Bayes, K-nearest neighbors, gradient boosting, and various tree-based models.Among these, tree-based models, particularly random forest and XGBoost, were found to outperform other models in six out of nine studies that conducted such comparisons [58,63,76,82,85,87].However, there were instances where other algorithms were superior to tree-based models.For example, in one study, an SVM-distinguished by its capacity to identify the optimal margin separating different classes, particularly in high-dimensional spaces-exhibited superior performance [81].Additionally, in another study, an ensemble voting model comprising random forest, logistic regression, and Gaussian Naïve Bayes excelled.This model considerably enhanced predictive accuracy by amalgamating the strengths of multiple models to reduce individual errors and variances [74].This indicates that while tree-based models are favorable, there is a necessity to explore the potential of other ML models further.
In the examined research, two studies developed predictive models for both overall survival and disease-specific survival [71,72].Peng et al. observed that for the oral cavity, the performance of the RSF model was superior for disease-specific compared to overall survival based on the C-index score (0.84 vs 0.80).Another study by Adeoye et al. identified a consistent pattern showing that for the oral cavity, the C-index for disease-specific was higher than overall survival (0.89 vs. 0.77).The possible explanation for these findings can be that disease-specific survival only considers deaths attributed to the disease being studied, thus providing a more focused measure of a treatment's effectiveness against the disease.In contrast, overall survival includes all causes of death, which can introduce more variability and potential confounding factors that could affect the model's performance.Additionally, based on Table 2 and the chosen features, these results might be due to selecting predictors that are more tailored to the specific disease survival rather than overall survival.
The performance of the predictive models exhibited minimal variations across different follow-up durations.In the analysis of three studies comparing three-and five-year survival rates, two studies reported a marginally higher C-index for three-year survival (0.77 vs 0.76 and 0.82 vs 0.79) [66,78], while another study, which utilized AUROC as a metric, found that the performance of ML models was negligibly higher for five-year vs three-year survival (0.63 vs. 0.61) [88].These observed discrepancies may be related to the evolving nature of the diseases under study and the application of static variables at baseline in the models.While certain variables, such as patient socioeconomic status and health behaviors, may alter over time, the slight change in model performance suggests that baseline variables maintain consistent predictive value for extended follow-up periods compared to shorter ones.
In the analysis of disease progression outcomes among the 11 studies reviewed, none of the studies provided data on time-to-event metrics.However, according to binary classification metrics, ML models demonstrated superior performance compared to traditional linear regression models.The oral cavity was the only site with enough articles to synthesize the results.For ML models, the AUROC ranged from 0.67 to 0.88, while for linear models such as logistic regression models, the AUROC ranged from 0.68 to 0.73 for the same studies [64,79,84].In the context of disease progression, in contrast to the findings on survival outcomes, there was no consistent pattern favoring a particular model, such as tree-based models.Given the limited number of studies and the scarcity of research on certain outcomes, further investigation is essential before drawing definitive conclusions.
The analyzed studies encountered several limitations, notably in the lack of comprehensive details on the treatments provided, such as surgical methods and dose-volume histogram information, which can impact both the development and performance of models.Additionally, while the majority of the research focused on survival outcomes, particularly overall survival, less attention was given to disease-specific outcomes and aspects of disease progression such as recurrences and metastasis, which significantly influence patient prognosis.Despite the emphasis on adhering to Transparent Reporting of a multivariable prediction model for Individual Prognosis (TRIPOD) reporting guidelines [92], only 11 out of 34 studies conducted time-to-event analyses and took into account the censored data.Moreover, the evaluation methods for the models predominantly focused on discrimination metrics, with only six studies including calibration to assess the alignment between predicted probabilities and actual outcomes, an essential component in model validation.Given the identified limitations, future research should focus on establishing more robust and transparent methodologies for ML modeling in HNC prognosis.It is imperative to conduct external validations to establish the generalizability of the models, which were notably scarce in the reviewed studies, with only four out of 34 studies undertaking this crucial step.Emphasizing the use of multicenter databases is also recommended to mitigate potential regional and demographic biases.Additionally, while ML models have shown promising results in HNC prognosis, there is a significant need for clinicians to be thoroughly educated on the nuances of each model, including their strengths, limitations, and biases.Such knowledge is crucial for clinicians to effectively leverage these models in clinical settings, promoting their broader adoption and integration into clinical practice.

Conclusion
ML models exhibit considerable potential in predicting post-treatment survival and progression in HNC patients.ML models consistently outperformed traditional linear models, such as logistic regression and Cox PH, as well as the nomograms derived from these models.Among ML models, DeepSurv followed by tree-based models demonstrated the highest performance.Regarding survival outcomes, models focusing on disease-specific outcomes achieved higher performance compared to those targeting overall survival, while there was no meaningful difference between follow-up durations of three and five years.There were fewer models for disease progression outcomes, with only one conducting time-to-event analyses.The studies generally lacked detailed incorporation of treatment specifics into their models, which could potentially improve model performance.Future research should integrate more comprehensive treatment data, place a greater emphasis on disease progression outcomes, and establish model generalizability through external validations and the utilization of multicenter datasets.

Fig 2 .
Fig 2. The risk of bias and applicability concerns of the included studies based on PROBAST.Abbreviations.ROB: risk of bias, PROBAST: Prediction model Risk Of Bias ASsessment Tool.https://doi.org/10.1371/journal.pone.0307531.g002